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Abstract 



In this work we consider the stochastic minimization of nonsmooth convex loss functions, 
a central problem in machine learning. We propose a novel algorithm called Accelerated 
Nonsmooth Stochastic Gradient Descent (ANSGD), which exploits the structure of common 
nonsmooth loss functions to achieve optimal convergence rates for a class of problems 
including SVMs. It is the first stochastic algorithm that can achieve the optimal 0(l/t) 
rate for minimizing nonsmooth loss functions (with strong convexity). The fast rates are 
confirmed by empirical comparisons, in which ANSGD significantly outperforms previous 
subgradient descent algorithms including SGD. 

1. Introduction 

Nonsmoothness is a central issue in machine learning computation, as many important 
methods minimize nonsmooth convex functions. For example, using the nonsmooth hinge 
loss yields sparse support vector machines; regressors can be made robust to outliers by 
using the nonsmooth absolute loss other than the squared loss; the n-norm is widely used 
in sparse reconstructions. In spite of the attractive proper ties, nonsmooth functions ar e 
theoretically more difficult to optimize than smooth functions Nemirovski and Yudin ( 19831 ). 
In this paper we focus on minimizing nonsmooth functions where the functions are either 
stochastic (stochastic optimization), or learning samples are provided incrementally (online 
learning). 

Smoothness and strong-convexity are typically cer tificates o f the ex istence of fast global 
solvers. Nesterov's deterministic smoothing method iNesterovl (|2005bl ) deals with the dif- 
ficulty of nonsm ooth func t ions by approximating them with smooth functions, for which 
optimal methods lNesterovl (|2004l ) can be applied. It converges as f(x t ) — min x /(x) < 0(l/t) 
after t iterations. If a nonsmooth function is strong l y conve x, this rate can be improved to 
0(l/t 2 ) using the excessive gap technique Nesterov ( 2005al ). 

In this paper, we extend Nesterov's smoothing method to the stochastic setting by 
proposing a stochastic smoothing method for nonsmooth functions. Combining this with a 
stochastic version of the optimal gradient descent method, we introduce and analyze a new 
algorithm named Accelerated Nonsmooth Stochastic Gradient Descent (ANSGD), for a class 
of functions that include the popular ML methods of interest. 

To our knowledge ANSGD is the first stochastic first-order algorithm that can achieve 
the optimal 0(1/ 1) rate f or min imizing nonsmooth loss functions without Polyak's averag- 
ing Polyak and Juditsky ( 19921 ). In comparison, the classic SGD converges in O (In t/t) for 



1. A short version of this paper appears in International Conference of Machine Learning (ICML) 2012. 
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nons mooth strongly convex functions IShalev-Shwartz et al,l (120071) . and is usually no t ro- 
bust Nemirovski et al. ( 20091 ). Even with Polyak's averaging Bach and Moulinesl ( 2011 ); Xu 
(120111). t here are cases where SGD's convergence rate still can not be faster than 0(lnt/t) 
Shamir! (|201ll ). Numerical experiments on real- world datasets also indicate that ANSGD 



converges much faster in comparing with these state-of-the-art algorithms. 

A pertur bation-based s moot hing method is recently proposed for stochastic nonsmooth 
minimization iDuchi et al.l (|201ll ). This work achieves similar iteration complexities as ours, 
in a parallel computation scenario. In serial settings, ANSGD enjoys better and optimal 
bounds. 

In machine learning, many problems can be cast as minimizing a composition of a loss 
function and a regularization term. Before proceeding to the algorithm, we first describe a 
different setting of "composite minimizations" that we will pursue in this paper, along with 
our notations and assumptions. 

1.1 A Different "Composite Setting" 

In the classic black-box setting of first-order stochastic algorithms iNemirovski et al. (120091 ). 
the structure of the objective function min x {/(x) = E^/(x,^) : £ ~ P} is unknown. In 
each iteration t, an algorithm can only access the first-order stochastic oracle and obtain a 
subgradient /'(x, £ t ). The basic assumption is that /'(x) = E^/'(x, £) for any x, where the 
random vector £ is from a fixed distribution P. 

The composite setting (also known as splitting Lions and Mercier (jl979l )) is an extension 
of the black-box model. It was proposed to exploit the structure of objective functions. 
Driven by applications of sparse signal reconstruct i on, it has gained significa n t interest 
from different communities Dau bechies et al. (|2004l ): iBeck and Teboullel 1120091) : iNesterovj 
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2007al ). Stochastic varian t s have also been propose d recen t ly iLanl (j2010l ): Lan and Ghadimj 



Duchi and Singer! d2009l ): iHu et al.1 (120091 ): IXiaol (I2OI0I ). A stochastic composit< 
function ^(x) := /(x) + g(x) is the sum of a smooth stochastic convex function /(x) = 
E^/(x,^) and a nonsmooth (but simple and deterministic) function g(). To minimize <&, 
previous work construct the following model iteratively: 



<V/(x t ,&),x-xt> + -D(x,x t ) + 5 (x), 
Vt 



(1) 



where V/(x^,£ t ) is a gradient, D(-,-) is a proximal function (typically a Bregman diver- 
gence) and rjt is a stepsize. 

A successful application of the composite idea typically relies on the assumption that 
model (HJ) is easy to minimize. If g() is very simple, e.g. ||x||i or the nuclear norm, it is 
straightforward to obtain the minimum in analytic forms. However, this assumption does 
not hold for many other applications in machine learning, where many loss functions (not 
the regularization term, here the nonsmooth g() beco mes the nonsmooth l oss function) are 
nonsmooth, and do not enjoy separability properties IWright et al. (|2009l ). This includes 
important examples such as hinge loss, absolute loss, and e-insensitive loss. 

In this paper, we tackle this problem by studying a new stochastic composite setting: 
min x <3?(x) = /(x) + g(x), where loss function /() is convex and nonsmooth, while g() is 



2 



convex and L 9 -Lipschitz smooth: 



S(x) <5(y) + (V 5 (y),x-y) + ^||x-y|| 2 . (2) 

For clarity, in this paper we focus on unconstrained minimizations. Without loss of general- 
ity, we assume that both /() and g() are stochastic: /(x) = E^/(x,^) and <?(x) = E^g(x, £), 
where £ has distribution P. If either one is deterministic, its £ is then dropped. To make 
our algorithm and analysis more general, we assume that gQ is ^-strongly convex: Vx, y, 

S(x) >5(y) + (V 5 (y),x-y) + |||x-y|| 2 . (3) 

If it is not strongly convex, one can simply take [i = 0. 

The main idea of our algorithm again stems from exploiting the structures of /() and 
g(). In Section [2] we propose to form a smooth stochastic approximation of /(), such that 
the optimal methods Nesterov ( 20041 ) can be applied to attain optimal convergence rates. 



The convergence of our proposed algorithm is analyzed in Section [3J and a batch-to-online 
conversion is also proposed. Two popular machine learning problems are chosen as our 
examples in Section [H and numerical evaluations are presented in Section All proofs in 
this paper are provided in the appendix. 



2. Approach 

2.1 Stochastic Smoothing Method 

An impo rtant brea k through i n nons mooth minimization was made by Nesterov in a series 
of works iNesterovl (j2005bl lal. l2007bh . By exploiting function structures, Nesterov shows 
that in many applications, minimizing a well-structured nonsmooth function /(x) can be 
formulated as an equivalent saddle-point form 



min f fx) = minmax 



(Ax,u)-Q(u) 



(4) 



where u 6 W 71 , IA C R m is a convex set, A is a linear operator mapping MP — > R m and 
Q(u) is a continuous convex function. Inserting a non-negative (^-strongly convex function 
u;(u) in @) one obtains a smooth approximation of the original nonsmooth function 



/(x,7) := max 



(Ax, u) - Q(u) - ju}(u) 



(5) 



where 7 > is a fixed smoothness parameter which is crucial in the convergence analysis. 
The key property of this approximation is: 



Lemma 1 \Nesteroi\ te005& ){Th eorem 1) Function /(x, 7) is convex and continuously dif- 

II All 2 

ferentiable, and its gradient is Lipschitz continuous with constant Lj := " , where 

\\A\\ := max{(,4x, u) : ||x|| = 1, ||u|| = 1}. (6) 



3 



Nesterov's smoothing method was originally proposed for deterministic optimization. 
A major drawback of this method is that the number of iterations N must be known 
beforehand, such that the algorithm can set a proper smoothness parameter 7 = O( ^jJ^ ) 
to ensure convergence. This makes it unsuitable for algorithms that runs forever, or whose 
number of iterations is not known. Following his work we propose to extend this smoothing 
method to stochastic optimization. Our stochastic smoothing differs from the deterministic 
one in the operator A and smoothness parameter 7, where both will be time-varying. 

We assume that the nonsmooth part /(x, £) of the stochastic composite function $() 
is well structured, i.e. for a specific realization £ t , it has an equivalent form like the max 
function in ffl: 



max 



(^ t x,u) -Q(u) 



(7) 



where is a stochastic linear operator associated with £ t . We construct a smooth ap- 
proximation of this function as: 



/( x ,£ t ,7t) := max 
ueU 



(A it K,u) - Q(u) - 7*0/(11) 



(8) 



where jt is a time- varying smoothness parameter only associated with iteration index t, and 
is independent of £ t . Function u/() is non-negative and ^-strongly convex. Due to Lemma 

>7t) is — Lipschitz smooth. It follows that 



Lemma 2 Vx,y,i, E^/(x,4, 7t ) < E € /(y,€,7 t )+E £ <V/(y,€,7t),x-y) + 



7*C 



l x -y|| 



We have the following observation about our composite objective <£(), which relates the 
reduction of the original and approximated function values. 



Lemma 3 For any x,xj,t, 

$(x t ) - $(x) < E € [/(x t ,^,7t) + 5(x t ,0 
where Du := max ug ^w(u). 



E; 



/(x,€,7t)+5(x,0 +7t^W, (9) 



2.2 Accelerated Nonsmooth SGD (ANSGD) 

We are now ready to present our algorithm ANSGD (Algorithm!!]). This stochastic algorithm 
is obtained by applying Nesterov's optimal method to our smooth surrogate f unctio n , and 
thus has a similar form to that of his original deterministic method iNesterovl ((2004!) (p. 78). 
However, our convergence analysis is more straightforward, and does not rely on the concept 
of estimate sequences. Hence it is easier to identify proper series 7j,r/ t ,a t and 9t that are 
crucial in achieving fast rates of convergence. These series will be determined in our main 
results (Thmj6]and[7]). 



3. Convergence Analysis 

To clarify our presentation, we use Table [U to list some notations that will be used through- 
out the paper. 
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Algorithm 1 Accelerated Nonsmooth Stochastic Gradient Descent (ANSGD) 

INPUT: series j t , Vt, &t > and < a t < 1; 

OUTPUT: x m ; 

[0.] Initialize xo and vo; 

for t = 0,1,2, ... do 

n i , (i-Qt)(A»+0t)xt+Qt0tvt 

W y * ^ Ml-at)+0 t 

(^ t+ i x » u ) - <3( u ) - 7*+iw(u) 



[2.] /t+i(x) <- max 
[3.] x t+ i <-y t -nt 



V/ t+ i(y t ) + V 5m (y t ) 



r , i , 0tv*+/iyt-[V/t+i(yt)+Vgt+i(yt)] 

[ 4 -] v *+i < 1 

end for 



Table 1: Some notations. 



Symbol 


Meaning 


/ t (x), 5 t (x) 
V/ t (x), V 5t (x) 

0"<( X ) 

a 2 
A t 

A 2 


7t)> 

V/(x,£ t ,7t), V<?(x,£ t ) 

[V/ t (x) + Vst(x)] - E €t [V/ t (x) + V ft (x)] 

Emax 4 ||<Tt + i(y t ) 2 
E €l [/ t (xt) + fft (x t )] - E €t [/ t (x) + <?(x)] 
(o"t+i(yt),«tx + (1 - a t )x t - y t ) 
iE||x - v f || 2 



Our convergence rates are based on the following main lemma, which bounds the pro- 
gressive reduction At of the smoothed function value. Actually Line 1, 3, and 4 of AlgQ] 
are also derived from the proof of this lemma. 

Lemma 4 Let jt be monotonically decreasing. Applying algorithm ANSGD to nonsmooth 
composite function <£(), we have Vx and Vt > 0, 

A t+ i < (1 - a t )A t + (1 - o t )( 7t - j t+ i)D u + 

|2 /..,/) Ml„ „ II 2 



r t+ i + — 



VtPQ + 



^||x - v t ||- -(n + t )\\x - v t+ i\ 



+ 



(10) 



a t 



, ^+1 2 

+ — — % - Vt 



2(ji + 9 t ) ' 2 
w/iere p := [|cr t+ i(y t )[| and g := ||V/ m (y t ) + Vg t+1 (y t )\\. 



3.1 How to Choose Stepsizes rjt 

In the RHS of (jlOH . nonnegative scalars p, g > are data-dependent, and could be arbitrarily 
large. Hence we need to set proper stepsizes m such that the last two terms in (|10p are 
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non-positive. One might conjecture that: there exist a series q > such that 

i Q * i 2 2^2 f-, \ 

Vtpq + [W^) ~ Vt ~ vt \ q - Ctp ■ ( } 

It is easy to verify that if we take rj t = -p^- t and any series q > 2(fi+8 t -atL t+ i) — ^> then 
(jllj) is satisfied. To retain a tight bound, we take 

2^ + e t -a t L t+1 y 

Taking expectation on both sides of (fTUj) and noticing that E| t+1 |£ IYn = 0, E^ t+i c t < 
2(fi+e t -atE 6 — L t +i) ^ ue ^° J ensen ' s inequality, we have 
Lemma 5 Vx and Vi > 0, 

EA m < (1 - a t )EA t + a t 6 t D 2 t - a t {^ + 9 t )D 2 +l 

+ V( — i a ^ wt ^o 2 + {^-u t ){lt-lt+i)D u , 1 

2(/x + t - a 4 EL m ) 

The optimal convergence rates of our algorithm differs according to the fact of ji (positive 
or not). They are presented separately in the following two subsections, where the choices 
of 7t, 9 t , at will also be determined. 

3.2 Optimal Rates for Composite Minimizations when \i = 

When /i = 0, g() is only convex and L 9 -Lipschitz smooth, but not assumed to be strongly 
convex. 

Theorem 6 Take a t = j^, jt+i = ®t, &t = L g a t + -j= + and r\ t = in AlgUl 

where Q is a constant. We have Vx and Vi > ; 

■ [^0 - .(,)] < + + W^/"> , (14) 

where D 2 : = maxj Z)|. 

In this result, the variance bound is optimal up to a constant factor Agarwal et al. ( 20121 ). 



The dominating factor is still due to the stochasticity, but not affected by the nonsmoothness 
of /(). Taking the parameter £1 = a/D, this last term becomes 2 -^=2-. This bound is better 
than that of stochastic gradient descent or stochastic dual averaging foekel et al.1 iboid ) for 



{ LD 2 D 2 +a 2 \ 

minimizing L-Lipschitz smooth functions, whose rate is O ( — H — ^= — J; without the 
smooth function g(), our bound is of the same order as it, keeping in mind that our rate 
is for nonsmooth minimizations. This fact underscores the potential of using stochastic 
optimal methods for nonsmooth functions. 

The diminishing smoothness parameter j t = indicates that initially a smoother 
approximation is preferred, such that the solution does not change wildly due to the non- 
smoothness and stochasticity. Eventually the approximated function should be closer and 
closer to the original nonsmooth function, such that the optimality can be reached. Some 
concrete examples are given in FigfTJ 

The E||A^|| 2 in our bound is a theoretical constant. In SecJJ]we demonstrate a sampling 
method, and it turns out to work quite well in estimating E||A^|| 2 . 
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3.3 Nearly Optimal Rates for Strongly Convex Minimizations 

When fjL > 0, g() is strongly convex, and the convergence rate of ANSGD can be improved 
to 0(l/t). 



Theorem 7 Take a t = j|p jt+i = «t, &t = L g a t + ^ + /i and Vt = -p^; in 

AlgU\ Denote 

C:=max<( — l ^ jL ,2( -f ) }> . (15) 

W^e /icwe Vx and Vt > 0, 



2 



Cm V i* 



m r«/ x , * -i 6.58L Q D 2 „ 4D U a 2 , x 

'^-•W^TM^w^ (16) 



where 



B := 



2E||A^H 2 D 2 /C 

,2n. " " ' ' ' - (17) 



/+l '/"-/• c, 



2(C-2)E||A S || 2 D 2 /C 



and £> 2 := max 0<i<min|4 c} D 2 . 



Note that C is the smallest iteration index for which one can retain 1/t 2 rates for the E||yl^|| 2 
part (B). Without any knowledge about L g , \i and E||j4^|| 2 , one can set a parameter and 

take 6t = L g at + ^ + — // in the algorithm. In our experiments, we observe that one 

can take fi fairly large (of 0(E||Ag|| 2 )), meaning that C can be very small (0(1)), and B is 
O(^) for all t. In this sense, strongly convex ANSGD is almost parameter- free. Without 
the 0(l/t) rate of Dy, all terms in our bound are optimal. This is why our rate is called 
"nearly" optimal. In practice, Du is usually small, and it will be dominated by the last 

2 

term ,,, ^ . 

3.4 Batch-to-Online Conversion 

The performance of an online learning (online convex minimization) algorithm is typically 
measured by regret, which can be expressed as 

t-i 

R(t) := £ [$(x,,e j+1 ) - Hx-lti+i)] , (18) 

i=0 

where x£ := argmin x Y2l=o [^K x >£i+i)] ■ in learning theory literature, many approaches 
are proposed which use online learning algorithms for batch learning (stochastic optimiza- 
tion), called "online-to-batch" (O-to-B) conversions. For convex functions, many of these 
approaches employ an "averaged" solution as the final solution. 

On the contrary, we show that stochastic optimization algorithms can also be used 
directly for online learning. This "batch-to-online" (B-to-O) conversion is almost free of any 
additional effort: under i.i.d. assumptions of data, one can use any stochastic optimization 
algorithm for online learning. 
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Proposition 8 For any t > 0, Ej < 

;E Cm [*(xi) - $(x*)] +E e[t] ^ - <K*Ui+i)] (19) 



i-l t-1 

E 1 

i=0 



where x* := argmin x <3?(x) and x£ := argmin x X^i=o [^( x ' ■ 

When $Q is convex, the sec ond term in (fT9l) can be bounded by applying standard results 



in uniform convergence (e.g. Boucheron et al. ( 20051 )): Y^i=i ^K x t) ~~ ^fet>£i+i) = 0(y/i). 



Together with summing up the RHS of (fT41) . we can obtain an Q(y/i ) regret bound. When 



< £() is strongly convex, the second term in (|19p can be bounded using IShalev-Shwartz et al.l 



ifcood ): E*=i$( x *) - ^(x^i+i) = OQnt). Together with summing up the RHS of (fl6|) . 



an O(lni) regret bound is achieved. The 0(yi) and O(lnt) regret bounds are known 

Using our proposed ANSGD for online learning by B-to-0 achieves the same (optimal) 
regret bounds as state-of-the-art algorithms designated for online learning. However, using 
O-to-B, one can only retain an 0(lnt/t) rate of convergence for stochastic strongly convex 
optimization. Prom this p erspective, O-to-B is in ferior to B-to-O. The sub-optimality of 



O-to-B is also discussed in lHazan and Kald (|201ll ). 



4. Examples 

In this section, two nonsmooth functions are given as examples. We will show how these 
functions can be stochastically approximated, and how to calculate parameters used in our 
algorithm. 

4.1 Hinge Loss SVM Classification 

Hinge loss is a convex surrogate of the — 1 loss. Denote a sample-label pair as £ := 
{s, 1} ~ P, where s G R-° and I £ R. Hinge loss can be expressed as /hinge( x ) := niax{0, 1 — 
Zs T x}. It has been widely used for SVM classifiers where the objective is min$(x) = 
minE^ /hi n g e (x) + ^||x|| 2 . Note that the regularization term g(x) = -|||x|| 2 is A-strongly 
convex, hence according to ThmJTl ANSGD enjoys 0(l/(Xt)) rates. Taking w(u) = ^||u|| 2 
in © , it is easy to check that the smooth stochastic approximation of hinge loss is 

/hinge (x, £ t , 7t) = max \u (l - l t sfx) - 7*^-1 • (20) 

0<u<l { Z J 

This maximization is simple enough such that we can obtain an equivalent smooth repre- 
sentation: 

'0 if/ t sfx>l, 

/hinge(x,&,7t) = | (1 ' ; gf x)2 if 1 - 7 t < ksfx < 1, (21) 

k 1 - l t sfx - ^ if l t sf x < 1 - 7 t . 

Several examples of /hinge with varying 7$ are plotted in FigQTjeft) in comparing with the 
hinge loss. 

Here u is a scalar, hence it is straightforward to calculate ^ , which will be used to 
generate sequences 6t- In binary classification, suppose I G {1,-1}. Using definition ([6]), 



8 




Figure 1: Left: Hinge loss and its smooth approximations. Right: Absolute loss and its 
smooth approximations. 



one only needs to calculate E(max|| x || =1 sfx.) 2 . Practically one can take a small subset of k 
random samples Sj (e.g. k = 100), and calculate the sample average of the squared norms 
\ Yli=i \\ s i\\ 2 - This yields \ Si=i( max ||x||=i s f x ) 2 , an estimate of E||^|| 2 . 



4.2 Absolute Loss Robust Regression 



Absol ute loss is an alternative to the popular squared loss for robust regressions lHastie et al 



(2009). Jsing same notations as Sec J4.ll it can be expressed as / a bs( x ) := \l — s T x|. Taking 
cj(u) = i||u|| 2 in ([8]), its smooth stochastic approximation can be expressed as 



u 2 ' 



/abs(x,£ t ,7t) = max i u(l t - s t x) - ~j t — \, (22) 

-Kti<l I Z I 



Solving this maximization wrt u we obtain an equivalent form: 



/abs(x,£i,7i) = < 



St x 



s/x)2 



21 
2 



2-yt 



sfx) 



11 

2 



if k 
if - 
if h 



It < h 



S* X < J t , 



(23) 



s t x< -It- 



This approximation looks similar to the well-studied Huber loss lHubeiMl964l ). though they 
are different. Actually they share the same form only when jt = 0.5 (green curve in FigQ] 
Right). 

The parameter E||A^|| 2 can be estimated in a similar way as discussed in Sec l4.11 
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5. Experimental Results 



In this section, five publicly available datasets from various application domains will be 
used to evaluate the efficiency of ANSGD. Datasets "svmguidel", "real-sim" "rcvl" and 
"alpha" are for binary classifications, and "abalone" is for robust regressions HI 

Following our examples in SecHJ we will evaluate our algorithm using approximated 
hinge loss for classifications, and approximated absolute loss for regressions. Exact hinge 
and absolute losses will be used for subgradient descent algorithms that we will compare 
with, as described in the following section. All losses are squared-/2-norm-regularized. The 
regularization parameter A is shown on each figure. When assuming strong-convexity, we 
take fj, = A. 



5.1 Algorithms for Comparison and Parameters 



We compare ANSGD with three state-of-the-art algorithms. Each algorithm has a data- 
dependent tuning parameter, denoted by Vt (although they have different physical mean- 
ings). The best values of f2 are found based on a tuning subset of samples. Note that when 
assuming strong-convexity, our ANSGD is almost parameter-free. As discussed after Thm|7j 
our experiments indicate that the optimal Q is taken such that ~ 1, meaning that 

one can simply take 9t = L g at + + 1 — H- 



SGD. The classic stochastic approximation Robbins and Monro! ( 195ll ) is adopted: x t+ i f 
x t — r )tf'{ x -t)i where /'(x^) is the subgradient. When only assuming convexity (p, = 0), we 
use stepsize r\ t 

Bottoul : rjt 



When assuming strong-convexity, we follow the stepsize used in SGD2 



Averaged SGD. This is algorithmically the same as SGD, except that the averaged re- 
sult x := | ^ i=1 X; is used for testing. We follow the stepsizes sugge s ted by th e recent 
work on the non-asymptotic analysis of SGD Bach and Moulines ( 201ll ); Xu ( 201ll ). where 
it is argued that Polyak's averaging combining with pr oper stepsizes yield optim al rates. 
When only assuming convexity, we use stepsizes r\t = M= L iBach and Moulines! ( 201ll ). When 

l 



Xiil (120 iih 



assuming strong convexity, the stepsize is taken as r/t = Q^i +flt /n)^/ i 

AC-SA. This approach |Lan| (|2010h : lLan and Ghadimil f|201lh is interesting to compare be- 
cause like ANSGD, it is another way of obtaining a stochastic algorithm based on Nesterov's 
optimal method, begging t he question of whether it has similar behavior. Theoretically, ac- 
cording to Prop. 8 and 9 in lLan and Ghadimil (120111 ). the bound for the nonsmooth part is 
of 0(1/ Vt) for /x = and 0(l/t) for /x > 0. In comparison, our nonsmooth part converges 
in 0(l/t) for fi = and 0(l/t 2 ) for fi > 0. Numerically we observe that directly applying 
AC-SA to nonsmooth functions results in inferior performances. 



1. Dataset "alpha" is obtained from ftp://largescale.ml.tu-berlin.de/largescale/, and the other 
four datasets can be accessed via http://www.csie.ntu.edu.tw/~cjlin/libsvmtools Dataset "rcvl" 
comes with 20,242 training samples and 677,399 testing samples. For "svmguidel" and "real-sim", we 
randomly take 60% of the samples for training and 40% for testing. For "alpha" and "abalone", 80% 
are used for training, and the rest 20% are used for testing. 
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5.2 Results 

Due to the stochasticity of all the algorithms, for each setting of the experiments, we run 
the program for 10 times, and plot the mean and standard deviation of the results using 
error bars. 

In the first set of experiments, we compare ANSGD with two subgradient-based algo- 
rithms SGD and Averaged SGD. Classification results are shown in Figj2j [3l H] and and 
regression results are shown in Figj6j In each figure, the left column is for algorithms 
without strongly convex assumptions, while in the right column the algorithms assume 
strong-convexity and take /i = A. For classification results, we plot function values over the 
testing set in the first row, and plot testing accuracies in the second row. 
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Figure 2: Classification with "svmguidel". 



It is clear that in all these experiments, ANSGD's function values converges consistently 
faster than the other two SGD algorithms. In non-strongly convex experiments, it converges 
significantly faster than SGD and its averaged version. In strongly convex experiments, it 
still out performs, and is more robust than strongly convex SGD. Averaged SGD performs 
well in strongly convex settings, in terms of prediction accuracies, although its errors are 
still higher than ANSGD in the first three datasets. The only exception is in "alpha" 
(FigH]), where Averaged SGD retains higher function values than ANSGD, but its accuracies 
are contradictorily higher in early stages. The reason might be that the inexact solution 
serves as an additional regularization factor, which cannot be predicted by the analysis of 
convergence rates. 

In the second set of experiments, we compare ANSGD with AC-SA and its strongly 
convex version. Results are in FigJ71[8l[9]and[T0j In all experiments our ANSGD significantly 
outperforms AC-SA, and is much more stable. These experiments confirm the theoretically 
better rates discussed in Sec l5.ll 
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rcal-sim, A — 10 -5 , ^ — 



real-sim, A = 10" 6 , fJ, = ltT 6 




Figure 3: Classification with "real-sim". 




6. Conclusions and Future Work 

We introduce a different composite setting for nonsmooth functions. Under this setting we 
propose a stochastic smoothing method and a novel stochastic algorithm ANSGD. Conver- 
gence analysis show that it achieves (nearly) optimal rates under both convex and strongly 
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Figure 5: Classification with "alpha". 
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Figure 6: Regression with "abalone". 



convex assumptions. We also propose a "Batch-to-Online" conversion for online learning, 
and show that optimal regrets can be obtained. 

We will extend our method to constrained minimizations, as well as cases when the 
approximated function /() is not easily obtained by maximizing u. Nesterov's excessive 
gap technique has the "true" optimal l/t 2 bound, and we will investigate the possibility of 
integrating it in our algorithm. Exploiting links with statistical learning theories may also 
be promising. 
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Figure 7: Classification with "svmguidel". 
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Figure 10: Classification with "alpha". 
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Appendix A. Proof of Lemma [3] 
Proof 

cD(x t ) - <D(x) 

= [/(x t ) - /(x)] + [ 5 (x t ) - <?(x)] 

= Eg [/(x t , £)] + Eg [-/(x, £) + 5 (x t , - ff(x, 0] 

= Eg max { [(Apt, u) - Q(u) - 7i w(u)] + ^w(u) 1 + Eg [-/(x, £) + 5 (xt, £) - 5 (x, £)] 
< Eg max [(^4gXt, u) - Q(u) - 7t w(u)] + max [ 7t u;(u)] + Eg [-/(x, £) + #(x t , £) - gr(x, £)] 



= Eg 
< Eg 



/(x t , £, 7t)J + ltD u + Eg [-/(x, £) + 5 (xt, £) - 5(x, €)] 
/(x t , £, 7t ) - /(x, £, 7t )l + Eg ^(xt, - g(x, 0] + 7t-Pw- 



(24) 



The last inequality is due to the non-negativity of cj() and definitions of / ([7]) and / 



Appendix B. Proof of Lemma i 

Before proceeding to the proof of this lemma, we present two auxiliary results. For clarity, 
in the following lemmas and proofs we use the following notations to denote the smoothly 
approximated composite function and its expectation: 

F t (x, 7t ) := /t(x) + <ft(x) = /(x,€ t ,7t) +9{*,£t) ( 25 ) 

and 

F(x, 7t ) :=Eg t F t (x, 7t ). (26) 

The first lemma is on the smoothly approximated function and the smoothness parameter 
It- 

Lemma 9 If is monotonically decreasing with t, for any x and t > 0, 

F(x, 7t ) < F(x, 7m ) < F(x, 7t ) + (jt - j t +l)Du, (27) 
where Dy : = max ug y w(u). 

Proof The left inequality is obvious, since jt > 7t+l an d w(u) is nonnegative. For the 
right inequality, 



F(x, 7m ) - F(x, 7t ) = Eg/(x,£, 7m ) -Eg/(x,£, 7i 
= max [(Egj4gx, u) — Q(u) — 7 t 

< max | [(EgAgx, u) - Q(u) - 7t+i^(u)] - [(EgAgx, u) - Q(u) - 7 tw(u)] j 

(28) 



max [(Egj4gx, u) — Q(u) — 7t+ iw(u)] — max [(EgAgx, u) — Q(u) — 7t o;(u)] 



max[(7t-7mV(u)] . 
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The second lemma is about proximal methods using Bregman di vergence as prox- 
f uncti ons, which is a direct result of optimality conditions. It appeared in Lan and Ghadimil 
|201l|)(Lemma2), and is an extension of the "3-point identity" IChen and Teboulld ()1993l ) (Lemma 
3.1). 

Lemma 10 Lan and Gha dimi hOli ) Let Z(x) be a convex function. Let scalars si,S2 > 0. 
For any vectors u and v, denote their Bregman divergence as D(u,v). If Vx, u, v 



x* = argminZ(x) + si-D(u,x) + S2-D(v,x) 



(29) 



then 



Z(x) + siD(u,x) + s 2 D(v,x) > l(x*) + siD(u,x*) + s 2 D(v,x*) + ( Sl + s 2 )D(x*,x). (30) 

We are now ready to prove Lemma HI 
Proof [Proof of Lemma 0] Due to Lemma [2] and Lipschitz-smoothness of g(x), F(x,jt+i) 

II a |[2 

has a Lipschitz smooth constant Lp t+l '■= — I" L g . It follows that 
F(x m ,7 m ) 

< ^(yt,7t+i) + (VF(y<,7 t+ i),x m - y t ) + — 5j±i[|x t +i - y t \\ 2 

= (1 - a t )F(y t ,j t+1 ) + a t F(y t ,7 t+ i) + (VF(y f ,7( + i),x w - y t ) + — 5j±i||x t+ i - y 4 || 2 
= (1 - a t )^(yt,7t+i) + (VF(yt,7t+i), (1 " a t )(x t - y t )>+ 

a^(yt,7m) + (VF(y t ,7 t+ i),x t+ i - y t - (1 - a t )(x t - y t )) + — ^-||x t+ i -y 4 || 2 

< (1 - a f )F(x i ,7i + i) + a t F(y t ,7 t+ i) + (VF(y f ,7 H i),x 1+ i - y t - (1 - a t ){x t - y f ))+ 



^Ft+l || ||2 

— — xt+i - y t , 



(31) 



where the last inequality is due to the convexity of FQ. Subtracting F(x, jt+i) from both 
sides of the above inequality we have: 



F(x t+ i,7 t+1 )-F(x,7 t+ i) < (1 - Q t )F(x i ,7 t+ i) - F(x,7 t+ i) 

+ a t F(y t ,j t+1 ) + (VF(y t ,7t+l)> x *+i ~ Yt ~ (1 ~ «t)(xt - yt)) + — 

< (1 - a t ) [F(x t , 7*) + (7< - 7 t+ i)D w ] - F(x, 7 t+1 ) 

+ a t F(y t , 7t+l) + (VF(y t ,7 t+1 ),x t+ i -y t - (1 - a t )(x t - y t )) H y 1 

< (1 - a t )[F(x t ,7 t ) - F(x,7t)] - a t F(x,7 m ) + (1 - a t )(jt -~ft+i)Du 

Lp 

+ a t F(y t ,jt+i) + (VF(y t ,7t+i),x m - y t - (1 - a t )(x t - yt)) H ;rH| 



x*+i - yd 



|x t+ i - yd 



x t +i - yd 
(32) 
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where the last two inequalities are due to Lemma [9j 

Denoting A t := F(xt,7t) — F(x-,lt) and 0t(x) := VF t (x,7 4 ) — VF(x,7f) we can rewrite 
as: 



A m - (1 - a t ) A t - (1 - a t )(7 t - j t+1 )D u 

< a t F(y t ,7 t+ i) - a t F(x,7 t+ i) + (V-F(y t ,7 t+1 ),x t+1 -y t - (1 - a t )(x t - y t )) + 



ll x m - yt\ 



< a t F(y t ,j t +i) ~ a t F(y t , 7 m ) + (VF(yt , 7 m ), x - y t ) + - ||x - yt|| 2 
<VF(y t)7t+1 ),x t+1 - y t - (1 - a t )(x t - y t )) + ^±i||x t+1 - y t f 



+ 



-a t 



(VF t+ i(y t ,7 t+1 ) - o- t+ i(y t ),x- y t ) + ^||x- yt|| 2 



+ 



(VF(y 4 ,7t +1 ),x t+1 - y t - (1 - a t )(x t - y t )) + — ^±l||x t+1 -yt|| 2 



-a* 



(V-F t+ i(yt,7 m ),x- y t ) + ^||x -y t \\' 2 + ^||x - v t 



■ a* 6 ** n 1,2 , 



(VF(y t ,7 t+ i),x t+1 - yt - (1 -a t )(x t -y t )) + — ^||x t+1 -yt|| 2 + {a t+ i(y t ), a*(x - y t )) 



< -a t 
a t t 



/i . . 1 1 2 ^ t 1 1 n2/ i ~'~^ii || o 

(VF t+ i(yt,7t + i),vt + i-y t ) + -||v t+ i-yt|| + — ||v t+ i - v t || H — ||x-v t+ i|| 



+ 



|x - v t || 2 + (VF(y ( ,7 f+ i),x w - y t - (1 - a t )(x t - y t )} + [|x t+ i - yt|| 2 + 



(o"t+i(yt),a t (x - y t )), 



(33) 



where the last inequality is due to LemmafTUI (taking D(u, v) = i||u— v|| 2 ) and the definition 
of v t +i: 



v t+1 := argmin(VF m (yt,7 m ),x - y t ) + ~||x - y t || 2 + ^||x - Vf|| 2 . 



Minimizing the above directly leads to Line 4 of Algfl] 

0*v t + fiy t - VF t+ i(y t , -y t +i) 

Base on this updating rule, it is easy to verify the following inequality: 



(34) 



(35) 



at 



a,. no at n i 

7rll v m ~ y*ll + 7rll v t+i ~ v *l 



< 



at 



v t - ytll + 



+ 



||VF m (yt,7m)lP 



< 



-at 



2 (/x + ft) 



||VFt +1 (yt,7m)H 2 - 
(36) 



To set xt+i (Line 3 of AlgfT]), we follow the classic stochastic gradient descent, such that 
||xt+i -yt|| 2 can be bounded in terms of \\VF t+ i(y t , 7t+i)l| 2 : x m = Yt - rjtVF t+1 (y t , j t+ l)- 
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Hence 

l|x m -yt|| 2 = ^ 2 ||VF m (y t , 7m )|| 2 ) (37) 

and 

(VF(y t ,7 t+ i),x t+ i -y t ) = (VF t+1 (y t ,j t+ i) - a t+1 (y t ),x t+1 -y t ) 
< -7 7t ||V J P t+ i(y t ,7 i +l)H 2 +^ll^+i(y*)]| ■ ||VF m (y 4 , 7t+1 )||. 



Inserting (I35I36I37I and [38]) into (i33"j) we have 
A t +i < (1 - a t )A t + (1 - a t )(7t - jt+i)D u + 

" "? t ||x - v t || 2 - (ji + t )||x - v m || 2 ] + (o- t+ i(y t ),a t (x - y t ) + (1 - a t )(x t - y t )> + 



2 



% |K +1 (y t )|H|VF m (y t , 7m )|| + 



+ —z-Vt ~ Vt 



2{n + t ) 2 



||VF m (y 4 , 7m )|| 2 + 



(vF t+1 (y t!7t+1 ), ~ at ^l t 9 ~ yt) - (1 - a*)(x t - y t ) > . 



Taking the last term at ^j^ yt ^ — (1 — aj)(x t — yt) = recovers the updating rule of yt 



(39) 



(Line 1 of AlgQ]). Hence our result follows. 



Appendix C. Proof of Theorem [6] 

Proof It is easy to verify that by taking at = j^, 7 t+i = at and 6t = L g at + + -J2=. 



we have Vt > 1: 

(1 - a t _i)( 7t _i - 7*) < 7t - 7t+i , (40) 

and 

1 > 2(0 t -i - a t -iEL t ) " 2(0* - a 4 EL m ) ' 1 j 

Next we define and bound weighted sums of D 2 that will be used later. 

(t) := Mt - (1 - otjot-i^i] A 2 + (1 - a t ) [at-i^t-l - (1 - ^-1)^-2^-2] A-1 + 
(1 - a t )(l - a t _i) [at-2^-2 - (1 - at-2)«t-3^-3] A 2 -2 H > 

(42) 

where replacing at and 6 t by their definitions we have Vt: 



(i + l)2( t + 2)2 (t+l)(t + 2) (t + l)(t + 2) 

(43) 



19 



Substituting ()43|) into ()42p and using invoking the definition of D 2 we have Vt: 



< AL g D 2 
2E\\A^\\ 2 D 2 



1 



+ 



t(t + l) 



1 



+ 



(i-l)t 



1 



(t + l)2(i + 2)2 (t+l)(i + 2)*2(t+l)2 (t + l)(t + 2) (t- 1)2 £ 2 



+ ••• 



+ 



c 



1 



+ 



t(t + 1) 



1 



+ 



(t - l)t 



1 



(* + !)(* + 2) (t +!)(* + 2) + (t + l)(t + 2) (t-l)t 



+ 



+ \/2fi£> 



(t + l)vT+2 - iVt+T i(t + l) tVF+T- (t -!)>/* 



(t + l)(t + 2) (t + l)(i + 2) 

(t-l)t (t-l)v / t-(t-2) v / ^T 



t(t+l) 



+ 



+ 



(t + l)(* + 2) 

(t + + 2) 
2E||^|| 2 J D 2 
C 

V2VD 2 
(t + !)(* + 2) 



(t-1)* 
1 1 



+ 



i + 1 i + 2 
1 



+ I - 



+ 



t t + 1 



+ 



+ 



1 



f - 1 t 
1 



r 



+ 



(t + l)(t + 2) (t+l)(* + 2) (t + l)(t + 2) 

(t + + 2 - tVt+1 + tVt + i - (t - + (t - - (t - 2)V* - 1 + 



< a t 6 t D 2 . 



(44) 



Since /i = 0, by recursively applying (fl"3|) and 1 — ao = we have 



EA m < (1 - a t )EA t + a t 8 t (D 2 - D 2 t+1 ) + 



at 



-a 2 + (1 - a t )(j t ~ lt+i)D u 



2(9 t - a t EL t+1 ] 

< (1 - a t )(l - a t _i)EA t _! + a t t (D t 2 - D t 2 +1 ) + (1 - a t ) at-i^-i (^ 2 _l " A") + 
2a t 2 



2(0* - a t EL m 



V + 2(l-a t )(7t-7t+i)A/ 



< 



H ^2 

< Mt£ 2 + 



2(0 t - ajELt+i) 
2 A, 



t - a 4 EL t+1 + t + 2 

7%t(j 2 2D, 



[a t 2 EL m + fi^] # 2 + 



n 



+ 



t + 2 



(45) 



Combining with Lemma [3] we have Vx 



E [*(xt+i) - $(x)] < [a t 2 EL m + Oy^] # 2 + 



< a 2 L g D + 7m + 



t + 2 



a t a 2 ,2Du 

ir + TT2 + ^ t+lDu 

7*+iC V " 



(46) 
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Taking jt + i = at = our result follows. 



Appendix D. Proof of Theorem [7] 

Proof It is easy to verify that by taking at = jxi, we have Vi > 1 



and 



Denote 



(1 - a f -i)(7t_i - 7t) < 7t - 7t+i 



(1 - a^a^! < a 2 



S t := a t 9 t - (1 - a t )(a t -i^-i + M"t-l" 



(47) 



(48) 



(49) 



Taking 9 t = L g a t + + 



E||-4gll : 



2at 1 C 



St = 4L n 



it is easy to verify that Vt > 1: 



1 2E[|^| 



+ 



i i 



t t + i 



t + i 



(50) 



We want to find the smallest iteration index C such that: when t > C, St < 0. Without 
any knowledge about L 9 and E||^|| 2 , minimizing St w.r.t f does not yield an analytic form 
of C. Hence we simply let 



4L„-J-_ < " 



and 



9 {t + l)H 2 2(t + l 
2EP^|| 2 



c 



1 1 



t i + 1 



< 



2(i + l) 



(51) 



(52) 



Inequality (|51j) is satisfied when 



i >2| ^ 



and (1521) is satisfied when 



t > 



4E||^|| 2 

Cm ' 



(53) 



(54) 



Combining these two we reach the definition of C in (I15D . Next we proceed to prove the 
bound. 
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As defined in the theorem, we denote D 2 = max <j< m i n (£ j( 7) D 2 . By recursively applying 
13|) for < i < t and noticing that St < Vt > C, 1 — ai = we have 

113 * 

EA t+ i < [J(l-« i )Ao + (t + l)(l-a t )(7 t -7 t+ i)A/+ 

i=0 

[(a^t)-D? - (a t 9 t + fia t )D 2 +1 ] + 
(1 - at) [(at-^t-i)^?-! " (at-l^-l + M«t-l)A 2 ] + 
(1 - a f )(l - a t _i) [(a*-2^-2)A 2 -2 - («t-2^t-2 + ^a*-2)A 2 -i] + 

• • • + - ai) [(a 6>o)£>o - («o#o + va )Dj} + 

8=1 

i 

2 + (1 - a t )a 2 _ 1 H h - a*) a 2 , 



cr 
/i 



i=l 



< — y + -D 2 (1 - cti) [ac-2%-2 - (1 - a C -2)(ac-3^c-3 + /uac_ 3 )] + 



i=C-l 



^ ~ a ') [ ft c-3^c-3 - (1 - ftc-3)(ac-4#c*-4 + /xac-4)] + 

i=C-2 

l) 2 ]J(1 - fti) [fti^i - (1 - ai)(ft O 0o + M«o)] + 



i=2 



(55) 



Applying (|50p by ignoring the — jtj term to the above inequality we can bound the coeffi- 
cients of L 9 and Sl^H parts separately as follows. 
When t > C, for the L 9 part: 

nU-i(i - «i) „ n-=c7- 2 a-^) „ nU- 3 (i-^) . . , uuo- - «*) 



(C-l) 2 (C-2) 2 (C-2) 2 (C-3) 2 (C-3) 2 (C-4) 2 



2 2 . l2 



1 



(t+l)t 



1 



+ 



+ 



1 



(C + 2)(C + 1) (C + 1)C C(C-l)) 



+ ••• + 



2 • 1 



(56) 



c+i 



< 



(t + l)t ^ i 2 - 6t(t + 1) 



For the part: 



n* = c-i(!-«i 



C-2 C-l 



1 x +n* =c - 2 (i-«i) ' 



2 1 



C-l C-2 C-2 C-3 

+ T— TTT - ~ . — + ••• + 



ft; 



(t + l)t (t + l)t (t + l)t (t+l)t 

C-l 1 C-2 



(t + l)t (t + l)t 



(t + i)t (t + i)t t(t+iy 



(57) 
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Combining with Lemma [3] and taking Jt+l = a t = t+T we have Vx: 

,nr~, x \i 2L> W 2ir 2 L„D 2 2{C - 2)E|| A f l| 2 5 2 /C <r 2 

E $(x t+ i) - $(x) < + ; 9 \ + — 7 Y — + r + 7f+l£>« 

LV + > y n ~ t + l 3£(> + l) t(t+l) /i(t + l) ' + 

_ 2vr%Z) 2 + 2(C - 2)E||^|| 2 L) 2 /C + 4L> W | a 2 



3t(t + l) + t + l + 

(58) 

When < i < C, one can simply put C = t in the above, and this completes our proof. ■ 



Appendix E. Proof of Proposition [8] 
Proof 

t-i 



E €[t] fl(t) = E €[t] ^ [*(x<,e i+ i) - *(xUi+i)] 

i=0 

= % E { - *(**)] + ^( x *) - } 

= XX+n [*(**> &h) - <*K x *)] +Ec M E ^( x *) - +% E t*( x *) - *( x Um)] 

i=0 i=Q i=0 

t-1 t-1 

^ E% [i+11 [*(*>€i+i) - *( x *)] + % 53 [*(*t) - 

i=0 i=0 

t-i t-i 
= - *(**)] +Ee 53 [*(xf) - $(xU m )] . 



i=0 i=0 
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